Dealing with collinearity in
quantile regression





Domenico Vistocco

@ Department of Political Sciences

University of Naples Federico II

Naples, ITALY

smiling woman with a tan complexion, dark eyes, and dark long wavy hair styled to one side

Link domenicovistocco.it
Google Logo Google Scholar
Graduation Cap ResearchGate
GitHub @domenicovistocco

Before we begin

  • This talk is a joint work with:
    • Cristina DAVINO (University of Naples, Federico II, Italy)
    • Rosaria ROMANO (University of Naples, Federico II, Italy)
    • Tormod NAES (Nofima Research Center, Norway)
  • A special thanks to Lea Petrella and the other members of the Scientific and Organizing Committee (Marco Geraci, Nicola Salvati, Beatrice Foroni, Nikos Tzavidis, Luca Merlo, Mila Andreani)

A well-known problem

Collinearity
a situation common to most regression applications is the presence of strong correlations between predictors
Effects
The effects of collinearity on least squares (LS) estimates are well investigated, this is not the case for quantile regression (QR)

Proposal: regression on latent components as a possible solution to collinearity in QR

Case study: assessment of the quality of service in presence of highly correlated predictors

“Simulation” study: analysis of different degrees of collinearity and various response distributions

When \(\mathbf{X}^\prime \mathbf{X}\) is nonsingular:

MLR model
\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e}\]
LS regression coefficients
\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{y} \]

The covariance matrix of the LS estimator is: \[ cov(\boldsymbol{\hat{\beta}})= \sigma^{2}(\mathbf{X}^\prime \mathbf{X})^{-1} \] and can be also formulated in terms of the singular value decomposition of the \(\mathbf{X}^\prime \mathbf{X}\) matrix: \[ cov(\boldsymbol{\hat{\beta}})= \sigma^{2}\sum_{a=1}^{A}\mathbf{p}_{a}(1/\lambda_{a})\mathbf{p}_{a}^\prime \] where the \(\mathbf{p}_{a}\)s and \(\lambda_{a}\)s are the eigenvectors and the eigenvalues of \(\mathbf{X}^\prime \mathbf{X}\).

Exact linear relationships
  • one predictor is an exact linear combination of some others
  • \(\mathbf{X}^\prime \mathbf{X}\) becomes singular and no unique \(\boldsymbol{\hat{\beta}}\) can be produced
Nearly linear relationships
  • \(\mathbf{X}^\prime \mathbf{X}\) is nearly singular and the estimation equation for the regression parameters is ill–conditioned
  • parameter estimates \(\boldsymbol{\hat{\beta}}\) will be unstable
  • the variances of the regression coefficients become very large
  • equation above suggests the same conclusion but in terms of the eigenvalues of \(\mathbf{X}^\prime \mathbf{X}\)

Variance Inflation Factor

\(VIF_{j}=\frac{1}{1-R^{2}_{j}}\)

  • factor of increase of the estimator’s variance due to the correlation between \(\mathbf{x_j}\) and the other explanatory variables (computable for each predictor \(\mathbf{x_j}\))
  • \(R^{2}_{j}\) is the squared correlation coefficient obtained by predicting \(\mathbf{x_j}\) with the remaining explanatory variables
  • multicollinearity is present if one of the \(R^{2}_{j}\) is close to \(1\); generally, a VIF of \(10\) or above indicates that (multi) collinearity is a problem

CN - Condition Number

\(CN=\left(\frac{\hat{\lambda}_1}{\hat{\lambda}_{J}}\right)^{1/2}\)

  • \(\hat{\lambda}_1\) and \(\hat{\lambda}_{J}\) are the largest and the smallest eigenvalue of the empirical covariance matrix of \(\mathbf{X}\)
  • an informal rule of thumb is that if the condition number is \(15\), multicollinearity is a concern, while multicollinearity is a severe concern if it is greater than \(30\)

In a nutshell:

Conditional quantile model
\[ Q_{\tau}(\hat{\textbf{y}} \vert \textbf{X})=\textbf{X}\hat{\beta}(\tau) \]
QR regression coefficients
\[\begin{eqnarray} \hat{\beta}\left(\tau\right) &=& \underset{\beta(\tau)} {\mathrm{argmin}} ~\sum_{i=1}^n \rho_{\tau}\left(y_i-\textbf{x}^{\top}_i\beta(\tau) \right) \\ &=& \underset{\beta(\tau)} {\mathrm{argmin}} ~ \sum_{y_i < \textbf{x}^{\top}_i\beta(\tau) } (1-\tau) \left|y_i- \textbf{x}^{\top}_i\beta(\tau) \right| + \sum_{y_i\geq \textbf{x}^{\top}_i\beta(\tau) } \tau \left|y_i-\textbf{x}^{\top}_i\beta(\tau) \right| \end{eqnarray}\]

Note

Since the asymptotic distribution of the QR estimator depends on the inverse of the variance covariance matrix, the variance of the QR estimator increases with the degree of correlation among the predictors

Well-known solutions

  • subset selection

    • reduce the number of predictors to compress the model variance
    • improve interpretability
  • stepwise selection

    • forward stepwise selection
    • backward stepwise selection

Note

  • due to the different degrees of freedom, a test error estimate is required to make the choice

  • training error measures adjusted for the model complexity (Mallow \(C_p\), \(AIC\), \(BIC\), adjusted \(R^2\)).

  • the \(p\) predictors are kept in the model, but the estimates of the coefficient are shrunken towards 0

  • this also compresses the variance

  • eventually, some of the coefficients maybe constrained to zero, implicitly excluding the corresponding predictors from the model

  • the type of constraint defines the shrinkage method

    • \(L_2\) constraint: ridge regression
    • \(L_1\) constraint: lasso regression
    • \(L_1\) and \(L_2\) constraint: elastic net

Note

Ridge regression assigns similar coefficients to correlated predictors, whereas lasso regression assigns quite different coefficients to correlated regressors

  • Methods that construct new \(m\) predictor variables (components) as linear combinations of the original predictor variables

  • Setting \(m < p\) the model complexity is reduced

Principal component regression - PCR
the components are obtained with the aim to explain the observed variability in the predictor variables, without considering the response variable at all
Partial least squares regression - PLSR
the components are obtained considering the response variable, and therefore often leads to models that are able to fit the response variable with fewer components

Our choice

Principal components analysis (PCA) is applied to the matrix of predictors \(\mathbf{X}\) in order to extract the \(m\) most dominating principal components \[ \mathbf{X} = \mathbf{TP}^\prime + \mathbf{E} \] where \(\mathbf{T}\) is called scores matrix and collects the \(m\) dimensions responsible for the systematic variation in \(\mathbf{X}\). Then, \[ \mathbf{y} = \mathbf{Tq} + \mathbf{f} \] where \(\mathbf{P}\) and \(\mathbf{q}\) are called loadings and describe how the variables in \(\mathbf{T}\) relates to the original variables in \(\mathbf{X}\) and \(\mathbf{y}\), respectively

The estimated scores \(\mathbf{\hat{T}}\) are used in the regression equation in place of the original predictors, where LSR is used to estimate the regression coefficients in \(\mathbf{q}\), and \(\mathbf{f}\) corresponds to the error term.

The PCR solution, i.e. the loadings \(\mathbf{\hat{P}}\) and the regression coefficients \(\mathbf{\hat{q}}\), can be combined to give the regression equation: \[ \hat{y}=\bar{y} + \mathbf{{X}\hat{P}\hat{q}}, \] which can be interpreted in the same way as a classical LSR and where the intercept is equal to the mean \(\bar{y}\) since the \(\mathbf{X}\) matrix is centered.

  • If \(m = p\) the resulting model is equivalent to the LSR model
  • the variability of the estimates is larger for the last components, corresponding to the smaller eigenvalues
  • the directions with the smaller eigenvalues have a large impact on the variances if they are unstable: this is quite natural since they are dominated by noise
  • the variances of PCR estimates are the same as for LSR except that now the influence of the eigenvalues after component A is eliminated
  • the representation of the regression coefficients on the principal components (loading plot) allows understanding which are the most critical variables in constructing the principal components and, therefore, in predicting the response variable
  • the representation of the scores (coordinates of the observations on the principal components) allows the detection of similarities and differences between the different statistical units and can be used for outlier detection

The extension of the principal component regression to the context of the QR is straightforward:

  • the extraction of the main components from the predictor matrix occurs in the same way

  • the regression of the response variable on the extracted components uses the QR instead of the LSR

  • The estimated scores matrix \(\mathbf{\hat{T}}\) of QPCR is obtained by minimizing the loss function: \[ ||\mathbf{X} - \mathbf{TP^\prime}||^{2}, \] whose solution is obtained through the SVD of \(\mathbf{X}\)

  • The estimated scores \(\mathbf{\hat{T}}\) are then used in the regression equation in place of the original predictors \[ Q_{\tau}(\hat{\textbf{y}} \vert \textbf{T})=\textbf{T}\hat{\beta}(\tau), \]

  • estimation of separate models for different asymmetries \(\tau \in [0, 1]\)

  • a dense set of quantiles completely characterizes the conditional distribution of the response

  • the use of PCR allows to obtain a single set of “artificial predictor” that can be used for all the different conditional quantiles

  • this unburden the interpretation issues that could be complex for the other methods used to face with multicollinearity

  • QPCR produce the same numerical and graphical outputs as PCR, with the only difference being that the results will be specific for each selected conditional quantile

Warm-up

Case study

  • data set regarding customers of a retail who offers products both online and in–store

  • data deals with a random sample of 632 customers from a companys customer relationship management system

  • the collected variables are:

    • age (age), distance from the store (distance.to.store), on line visits (online.visits), transactions (online.trans) and total spending (online.spend) during a year, store transactions (store.trans) and total spending (store.spend) during a year, level of satisfaction with service (sat.service) and with the selection of products (sat.selection)

  • assess whether and to what extent the purchasing behaviour, the level of satisfaction with the seller and the personal characteristics of customers influence purchases made online

  • assess whether this impact changes according to the amount of money spent

Start

Simulation study

  • different degrees of correlation among predictors and different types of response to compare LSR and QR

  • a sample size of 100 observations and 3 relevant predictors (only 1 relevant for prediction)

  • the population model explains 70% of the variation in the response

  • the coefficient \(\gamma\) regulated the level of collinearity:

    • low values \(\rightarrow\) no or very low collinearity
    • higher values \(\rightarrow\) higher collinearity
  • for each value of the \(\gamma\) grid, the standard errors of LSR and QR models were computed using the bootstrap procedure in order to have a fair comparison

  • 1000 simulations for each value in the design grid

  • different types of responses

    • classical normal i.i.d. errors, and hence response
    • normal heteroscedastic errors
    • skewness in the response
PC1 PC2 PC3
\(\gamma = 0\) 36.33 33.87 29.8
\(\gamma = 0.5\) 45.44 34.16 20.4
\(\gamma = 1.0\) 55.19 36.37 8.44
\(\gamma = 1.5\) 65.17 30.71 4.12
\(\gamma = 2.0\) 66.17 31.93 1.9
\(\gamma = 2.5\) 72.16 27.03 0.81
\(\gamma = 3.0\) 85.03 14.74 0.23
\(\gamma = 3.5\) 91.79 8.12 0.09
\(\gamma = 4.0\) 95.43 4.54 0.03
\(\gamma = 4.5\) 97.47 2.52 0.01
\(\gamma = 5.0\) 97.87 2.13 0.00
PC1 PC2 PC3
\(\gamma = 0\) 36.33 70.20 100.00
\(\gamma = 0.5\) 45.44 79.60 100.00
\(\gamma = 1.0\) 55.19 91.56 100.00
\(\gamma = 1.5\) 65.17 95.88 100.00
\(\gamma = 2.0\) 66.17 98.10 100.00
\(\gamma = 2.5\) 72.16 99.19 100.00
\(\gamma = 3.0\) 85.03 99.77 100.00
\(\gamma = 3.5\) 91.79 99.91 100.00
\(\gamma = 4.0\) 95.43 99.97 100.00
\(\gamma = 4.5\) 97.47 99.99 100.00
\(\gamma = 5.0\) 97.87 100.00 100.00

  • when original regressors are used in the model, standard errors increase in value and variability as the collinearity increases. This is more marked in QR than in LSR. The effect is more pronounced in the extreme parts of the distribution (\(\tau=0.1\) and \(\tau=0.9\))
  • the previous consideration is amplified in the heteroschedastic case
  • when PCs are used in place of the original regressors, multicollinearity does not affect the variability of the estimates both in the homoschedastic and heteroschedastic case
  • the distributions of the standard errors for \(X3\) and for the third PC coincide and the densities related to the two cases perfectly overlap
  • the variability of the estimators for \(X3\) and the third PC is more pronounced, especially in the heteroschedastic case
  • preliminary results

  • main limit: focus on the effect of collinearity on standard errors

Food for thoughts

Wishlist

  • effect of the collinearity in terms of bias

  • QCPR evaluation in terms of bias at different locations

  • and then effect on prediction ability

  • effect of the collinearity in terms of bias

  • QCPR evaluation in terms of bias at different locations

  • and then effect on prediction ability

Thanks for
following along!

domenico's avatar

Link domenicovistocco.it
Google Logo Google Scholar
Graduation Cap ResearchGate
GitHub @domenicovistocco

(not so much) subliminal advertising